System and method for generating a multimedia summary of multimedia streams

ABSTRACT

A system facilitates and enhances review of one or more multimedia input streams that includes some combination of video, audio and text information, generating a multimedia summary, thereby enabling a user to better browse and/or decide on viewing the multimedia input streams in their entirety. The multimedia summary is constructed automatically, based in part on system specifications, user specifications and network and device constraints. In a particular application of the invention, the input multimedia streams represent news broadcasts (e.g., television news program, video vault footage). In such a particular application, the invention can enable the user to automatically receive a summary of the news stream in accordance with previously provided user preferences and in accordance with prevailing network and user device constraints.

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. provisional application Ser.No. 60/483,765 filed Jun. 30, 2003, which is incorporated herein byreference.

The present invention relates generally to the summarization of video ormotion images, and in particular, to a system and method for providing amultimedia summary (video/audio/text)of a news broadcast to enable auser to better browse and decide on viewing the broadcast.

The amount of video content is expanding at an ever increasing rate.Simultaneously, the available time for viewers to consume or otherwiseview all of the desirable video content is decreasing. The increasedamount of video content coupled with the decreasing time available toview the video content, it becomes increasingly problematic for viewersto view all of the potentially desirable content in its entirety.Accordingly, viewers are increasingly selective regarding the videocontent that they select to view. To accommodate viewer demands,techniques have been developed to provide a summarization of the videorepresentative in some manner of the entire video. The typical purposefor creating a video summarization is to obtain a compact representationof the original video for subsequent viewing.

Advances are being made continually in the field of automated storysegmentation and identification, as evidenced by the BNE (Broadcast NewsEditor) and BNN (Broadcast News Navigator) of the MITRE Corporation(Andrew Merlino, Daryl Morey, and Mark Maybury, MITRE Corporation,Bedford Mass., Broadcast News Navigation using Story Segmentation, ACMMultimedia Conference Proceeding, 1997, pp. 381-389). Using the BNE,newscasts are automatically partitioned into individual story segments,and the first line of the closed-caption text associated with thesegment is used as a summary of each story. Key words from theclosed-caption text or audio are determined for each story segment thatmatch the search words. Based upon the frequency of occurrences ofmatching keywords, the user selects stories of interest. Similar searchand retrieval techniques are becoming common in the art. For example,conventional text searching techniques can be applied to a computerbased television guide, so that a person may search for a particularshow title, a particular performer, shows of a particular type, and thelike.

A disadvantage of the traditional search and retrieval techniques is theneed for an explicit search task, and the corresponding selection amongalternatives based upon the explicit search. Often, however, a user doesnot have an explicit search topic in mind. In a typical channel-surfingscenario, a user does not have an explicit search topic. Achannel-surfing user randomly samples a variety of channels for any of anumber of topics that may be of interest, rather than specificallysearching for a particular topic. That is, for example, a user mayinitiate a random sampling with no particular topic in mind, and selectone of the many channels sampled based upon the topic that was beingpresented on that channel at the time of sampling. In another scenario,a user may be monitoring the television in a background mode, whileperforming another task, such as reading or cooking. When a topic ofinterest appears, the user redirects his focus of interest to thetelevision, then returns his attention to the other task when a lessinteresting topic is presented.

Accordingly, a technique for automatically generating a multimediasummary that summarizes video, audio and text portions of a video stream(news broadcast) independent of a user having to explicitly use keywordsto search for particular news topics, is highly desirable.

The present invention overcomes the shortcomings of the prior art.Generally, the present invention is directed to a system and method forgenerating a multimedia summary of one or more input video sequencesthat allows a user to better browse and/or decide on viewing the videosequences in their entirety. The multimedia summary is constructedautomatically, based in part on system specifications, userspecifications and network and device constraints. In a particularapplication of the invention, the input video sequences represent newsbroadcasts.

One feature of the invention is to create a multimedia summary of aninput video stream which is suitable for use with a wide variety ofdevices that range from bandwidth constrained devices such as PDA's andcell phones to non-bandwidth constrained devices such as personalcomputers and multimedia workstations.

Another feature of the invention is to provide flexibility in the mannerin which the multimedia summary is constructed. That is, the inventionallows the user to customize the multimedia summary to suit theparticular user's viewing preferences. More particularly, a user mayprovide one or more parameters specifying, for example, whether themultimedia summary is to be comprehensive or quick; whether themultimedia summary should include only a summary of a single lead storyor a summary of the top lead stories; whether the summary should includeonly text, only audio or only video or combinations thereof. The usermay also provide one or more keyword parameters, which will be utilizedby the summarization system to select appropriate portions of text,audio and video from the input video stream for inclusion in themultimedia summary.

According to one aspect of the invention, a method for generating amultimedia summary of a news broadcast comprises the acts of: one ofreceiving and retrieving a multimedia stream comprising video, audio andtext information; dividing the multimedia stream into a videosub-stream, an audio sub-stream and a text sub-stream; identifyingvideo, audio and text key elements from said video, audio and textsub-streams, respectively; computing an importance value for theidentified video, audio and text key elements identified at saididentifying step; first filtering the identified video, audio and textkey elements to exclude those key elements whose associated importancevalue is less than a pre-defined video, audio and text importancethreshold, respectively; and second filtering the remaining key elementsfrom said filtering step in accordance with a user profile; thirdfiltering the remaining key elements from said second filtering step inaccordance with network and user device constraints; and outputting amultimedia summary from the key elements remaining from said thirdfiltering step.

Although this invention is particularly well suited to news broadcasts,the principles of this invention also allow a user to receive amultimedia summary of other types of broadcasts as well. For example,the invention is applicable to multimedia summaries of movie videos toallow a user to better browse and decide on viewing the movie in itsentirety.

The invention also comprises an article of manufacture for carrying outthe method. Other features and advantages of the invention will becomeapparent through the following detailed description, the drawings, andthe appended claims, taken in conjunction with the accompanying drawingsin which:

FIG. 1 is a schematic diagram of an overview of an exemplary embodimentof a multimedia summarization system in accordance with the presentinvention;

FIG. 2 is a flow diagram of a method of summarization in accordance withthe present invention;

FIG. 3 illustrates an exemplary video stream of a typical newsbroadcast;

FIG. 4 is a flow diagram of a method of identifying key elements inaccordance with the present invention;

FIG. 5 illustrates an example block diagram of the process of featureextraction and derivation of features from an input multimedia stream;and

FIG. 6 is a graph illustrating how the time elements which comprise theaudio sub-stream may be grouped to form segments; and

FIGS. 7 a-c are graphs illustrating various ways of identifying keyelements.

The present invention is directed to a system and method for summarizingone or more input multimedia streams via three modalities (video, audio,text).

It is to be understood that the exemplary system modules and methodsdescribed herein may be implemented in various forms of hardware,software, firmware, special purpose processors, or a combinationthereof. Preferably, the present invention is implemented in software asan application program tangibly embodied on one or more program storagedevices. The application program may be executed by any machine, deviceor platform comprising suitable architecture. It is to be furtherunderstood that, because some of the constituent system modules andmethods depicted in the accompanying Figures are preferably implementedin software, the actual connections between the system components (orthe process acts) may differ depending upon the manner in which thepresent invention is programmed. Given the teachings herein, one ofordinary skill in the art will be able to contemplate or practice theseand similar implementations or configurations of the present invention.

The present invention includes a computer program product which is astorage medium (media) having instructions stored thereon/in which canbe used to program a computer to perform any of the processes of thepresent invention. The computer program product may also include data,e.g., input data, corresponding to any of the processes of the presentinvention. The storage medium can include, but is not limited to, anytype of disk including floppy disks, optical discs, DVD, CD-ROMS,microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs,DRAMs, VRAMS, flash memory devices, magnetic or optical cards,nanosystems (including molecular memory ICs), or any type of media ordevice suitable for storing instructions and/or data.

Stored on any one of the computer readable medium (media), the presentinvention includes software for controlling both the hardware of ageneral purpose/specialized computer or microprocessor, and for enablingthe computer or microprocessor to interact with a human user or othermechanism utilizing the results of the present invention. Such softwaremay include, but is not limited to, device drivers, operating systems,and user applications. Ultimately, such computer readable media furtherincludes software for performing the present invention, as describedabove.

System Architecture:

With reference to FIG. 1, there is shown a schematic overview of oneembodiment of a multimedia summarization system 100 in accordance withthe present invention. By way of non-limiting example, the multimediasummarization system 100 will be presented in the context of asummarization system 100 for summarizing news stories, although theextensions of the principles presented herein to other multimediaapplications as will be evident to one of ordinary skill in the art.

In the embodiment, shown in FIG. 1, multimedia summarization system 100receives a multimedia stream 101 as input from a broadcast channelselector 110, for example a television tuner or satellite receiver. Thesystem 100 may also retrieve a pre-stored multimedia stream 102 from avideo storage vault 112. The system 100 may also receive input in theform of a video stream such as from a server on a network. Themultimedia input streams 101, 102 may be in digital or analog form, andthe broadcast may be any form of media used to communicate the streams101, 102 including point to point communications. In the embodiment,shown in FIG. 1, the input multimedia streams 101,102, by way ofnon-limiting example, correspond to news broadcasts, and includemultiple news stories with interspersed advertisements or commercials.The news broadcast could represent, for example, a particular newsprogram such as, CNN Headline News, NBC Nightly News, etc.

In the embodiment, shown in FIG. 1, the multimedia summarization system100, comprises a modality recognition and division (MRAD) module 103 fordividing the input multimedia streams 101, 102 into three modalities,referred to hereafter as a video sub-stream 303, audio sub-stream 305and text sub-stream 307. The MRAD module 103 comprises a story segmentidentifier (SSI) module 103 a, an audio identifier (AI) module 103 b,and a text identifier (TI) module 103 c for processing the inputmultimedia streams 101, 102 and outputting the video 303, audio 305, andtext 307 sub-streams, respectively. The sub-streams 303, 305, 307 areoutput from the MRAD module 103 to a key element identifier (KEI) module105 to identify key elements from within the respective sub-streams 303,305, 307. The KEI module 105 comprises a feature extraction (FE) module107, and an importance value (IV) module 109. The functionality of theKEI module 105 is described in further detail below in connection withFIGS. 4-7. The output of the KEI module 105 is coupled to the input ofthe key element filter (KEF) module 111 which filters the key elementsidentified by the KEI module 105, in a manner to be described below. Thesurviving key elements output from KEF 111 are provided as input to auser profile filter (UPF) 113 which further filters the surviving keyelements in accordance with a pre-determined user preference. As shown,the UPF module 113 is coupled to one or more storage devices (i.e., auser preference database 117) for storing the pre-determined userpreferences. The output of the UPF module 113 is coupled to the input ofthe network and device constraint (NADC) module 115, which may furtherfilter the surviving key elements output from the UPF module 113 inaccordance with the prevailing network conditions and user deviceconstraints. The NADC module 115 outputs the multimedia summarization120 of the invention. Typically the multimedia summary will be requestedby a remote user, via a client device 124, interfacing with thesummarization system 100 over a network 122 such as the Internet,intranet or any other suitable network. The client device 124 may be anyelectronic device operable to connect with and transmit data over thenetwork 122. For example, the client device 124 may include a wireddevice, (e.g., a personal computer, workstation, or fax machine) or awireless device (e.g., a laptop, personal digital assistant (PDA),mobile phone, pager, smartphone, wearable computing and communicatingdevice or communicator).

Operation:

An overview discussion of one embodiment of the multimedia summarizationmethod of the present invention is now provided, with respect to FIGS.1-3. Thereafter, more detailed descriptions of various acts associatedwith the described method will be provided further below.

FIG. 2 is a flow diagram illustrating a method of summarizationaccording to one embodiment of the present invention:

At act 205, the process starts.

At act 210, the multimedia summarization system 100 retrieves and/orreceives one or more multimedia streams 101, 102 (e.g., news broadcasts)as input.

At act 215, the retrieved/received input multimedia stream 101 isdivided in accordance with three modalities (i.e., video, audio andtext).

FIGS. 3 a-3 d illustrate, by way of example, how an input multimediastream (e.g., stream 101) may be divided in accordance with the threemodalities.

FIG. 3 a is a general illustration of an input multimedia stream 101(e.g., news broadcast 101) comprising video, audio and text componentsdistributed throughout. As stated above, the news broadcast couldrepresent, for example, a particular news program such as, CNN HeadlineNews, NBC Nightly News, etc.

FIGS. 3 b-3 d illustrate how the input video stream 101 may be divided,according to the three modalities.

Referring first to 3 b, in accordance with the video modality, a videosub-stream 303 is shown which represents the input multimedia stream 101processed to highlight news story segmentation. The video sub-stream 303of FIG. 3 b is shown to be output from the story segment identifier(SSI) sub-module 103 a of the MRAD module 103. The exemplary videosub-stream 303 is divided by the SSI sub-module 103 a into a pluralityof video frames (e.g., frames 1-25000) of which only 40 are shown forease of explanation. The division is based on the typical constructionof a news broadcast. That is, the typical news broadcast follows acommon format that is particularly well suited for story segmentation.For example, a first or lead story could be related to political eventsin Washington, a second news story could be related to a worker strikeor a building fire. For example, as shown in FIG. 3 b, after anintroduction frame 301 (frame 1), a newsperson, or anchor, typicallyappears 311 (anchor frames 2-4) and introduces a first reportage 321(frames 5-24). The anchor frames 2-4 and news story segment frames 5-24are collectively referred to as a first news story 311, 321. After thenews story, the anchor reappears 312 (anchor frames 25-29) to introducethe second reportage 322 (frames 30-39), referred to collectively as thesecond news story 312, 322. The sequence of anchor-story-anchor,interspersed with commercials, repeats until the end of the newsbroadcast, e.g., frame 2500. The repeated appearances of the anchor 311,312, . . . typically in the same staged location serves to clearlyidentify the start of each reportage segment segment and the end of theprior news segment or commercial. Also, as standards such as MPEG-7 aredeveloped for describing video content, it can be expected that videostreams will contain explicit markers that identify the start and end ofindependent segments within the streams.

One way of identifying news story segments is provided in EP Patent No.1 057 129 A1, “Personalized Video Classification and Retrieval System”issued on Dec. 6, 2000 to Elenbaas, J H; Dimitrova, N; Mcgee, T;Simpson, M; Martino, J; Abdel-Mottaleb, M; Garrett, M; Ramsey, C; Desai,R., the entire disclosure of which is incorporated herein by reference.

Referring now to FIG. 3 c, the audio sub-stream 305 is shown. Audioidentification in the input multimedia stream 101 is relativelystraight-forward in that the audio identifier sub-module 103 bpre-selects an audio boundary, e.g., 20 ms, in the exemplary embodiment,and divides the input multimedia stream 101 into a plurality of 20 msTIME frames 304 from start to finish to construct the audio sub-stream305.

Referring again to FIG. 1, input multimedia stream 101 is received bythe MRAD module 103 and processed by the audio identifier (AI)sub-module 103 b to output the audio sub-stream 305.

Referring now to FIG. 3 d, the text sub-stream 307 is shown. Textidentification is relatively straight-forward in that the textidentifier defines a frame 308 on word boundaries identified within thetext sub-stream 307.

Referring again to FIG. 1, input multimedia stream 101 is received bythe MRAD module 103 and processed by the text identifier (TI) sub-module103 c to output the text sub-stream 307. The video 303, audio 305 andtext 307 sub-streams, output from the MRAD module 103, are thereafterprovided as input streams to the key element identification KEI module105.

At act 220, a key element identification analysis is performed by theKEI module 105 on the input sub-streams 301, 303, 305 to identify keyelements within each respective sub-stream. A key element may begenerally defined as a ‘segment’ of a sub-stream 303, 305, 307 thatmeets or exceeds a pre-determined criteria, as will be described furtherbelow.

At act 225, those key elements identified at act 220 are furtherfiltered whereby only those key elements, whose computed importancevalue at act 220 which meet or exceed a pre-determined criteria areretained. The key element filter (KEF) module 111 of FIG. 1 performsthis filtering process.

At act 230, the user profile filter (UPF) module 113 of FIG. 1 firstdetermines whether the user has previously provided a user profile whichis comprised of a number of user customization parameters, preferablystored in the user preference database 117. In act 232, if a userprofile exists, it will be used to further filter those surviving keyelements from act 225.

The user profile may be comprised of a number of user providedcustomization (preference) parameters. The parameters may be providedeither at run time or preferably retrieved from the user preferencedatabase 117 by the UPF 109, to indicate particular customizationpreferences of the user regarding how the multimedia summary 120 is tobe presented. In the case where the customization parameters areretrieved from the user preference database 117, users of the systemwill typically store their preferences with the system 100 during aconfiguration stage. The customization parameters determine to someextent how the multimedia summary 120 is to be customized to suit theuser's particular viewing preferences.

The customization parameters provided by a user may include, forexample:

-   -   whether the multimedia summary 120 is to be comprehensive or        quick.    -   whether the multimedia summary 120 should include only text,        audio, still images, video or combinations thereof.    -   Tasks to be performed such as browsing for new videos vs.        recalling an already seen movie.    -   Venue where the summary 120 is to be viewed (i.e., context).    -   Time of day, week, month, year the multimedia summary 120 is        being viewed.    -   One or more “keyword” customization parameters may be provided        by the user to identify particular items of interest to the user        (e.g., persons, places or things). As one example, a user may        specify the keywords “Politics” and “Baseball” which will be        used by the video summarization system 100 to locate news story        segments which emphasize the selected keywords.

By way of example only, if a user prefers that the multimedia summary120 be restricted to just an audio summary, then a highest rated audiosegment can be chosen from the audio sub-stream 305 and presented to theuser. As a further example, if the user prefers to view a quickmultimedia summary 120 (e.g., a two-minute news summary), then the newsstories that the user is interested in are chosen in accordance with theuser profile preference and from within each selected news story onlythe highest rated video, audio and text segments are selected from therespective video 303, audio 305 and text 307 sub-streams to construct atime-constrained multimedia summary 120.

At act 235, those key elements surviving the previous act of userprofile filtering are now further filtered by considering network anddevice constraints. Specifically, the Network and Device Constraint(NADC) module 113 considers any network bandwidth constraints of thenetwork over which the multimedia summary 120 is to be transmitted andadditionally considers those constraints associated with the user'sviewing device. The surviving key elements from step 230 are modified inaccordance with any known network and device constraints, as will bedescribed.

In the typical case where the multimedia summary 120 is transmitted overa network, such as the Internet, the device constraints and availabletransmission bandwidth will dictate to some degree, the quality andquantity of the multimedia summary 120 to be transmitted. Due to theinherent bandwidth demands of video, the multimedia summary 120 willtypically be constrained in the quality and/or quantity of the videoportion of the multimedia summary 120. By comparison, the audio and textportions of a multimedia summary 120 will not suffer from similarconstraints.

Wireless networks represent a typical limited bandwidth application.Such limited bandwidth conditions may exist due to either the directtechnological constraints dictated by the use of a low bandwidth datachannel or indirect constraints imposed on relatively high-bandwidthchannels by high concurrent user loads. It is contemplated that thenetwork bandwidth may be monitored in a transparent manner in real-timeto determine the current state of the network. The multimedia summarymay be modified in accordance with a prevailing network condition. Forexample, in the case of a congested network condition, the multimediasummary 120 may be constrained by limiting the video quality of eachsurviving key element from act 235.

With regard to device constraint considerations, cellular connected PDAsand webphones are examples of devices that are characteristicallylimited in processing power, display capabilities, memory, operatingsystems and the like. As a result of these limitations, these deviceshave different abilities to receive, process, and display video data.The multimedia summary 120 may be adjusted to accommodate the deviceconstraints by limiting the video resolution, bit-rate, and so on.

If the user device is only capable of rendering text, then the highestranking text segments are chosen for each of the news stories and sentout to the device.

At act 240, the multimedia summary 120 comprised of those key elementssurviving act 235 are output to the user.

This discussion concludes the overview of the multimedia videosummarization system and method. A more detailed description of theoperation of various aspects of the method will now be provided.

A top-level description of an embodiment of the method of the inventionhas been provided above with reference to the flow diagram of FIG. 2which includes, inter alia, act 220 which is directed to theidentification of key elements from the respective video 303, audio 305and text 307 sub-streams. A more detailed description of act 220,key-element identification is described now with reference to FIGS. 3-6.

Referring now to FIG. 4, which is a detailed flow diagram of the actswhich comprise act 220 of the flow diagram of FIG. 2, and also referringto FIG. 5, which is a diagram further illustrating, by way ofnon-limiting example only, the process of feature extraction isdescribed comprising the extraction and derivation of features, in eachof the three modalities, from the respective sub-streams 303, 305, 307.

Act 220.a—Feature Extraction

At act 220.a, feature extraction is performed whereby low 510, mid 710and high 910 level features are extracted from each frame in each of therespective video 303, audio 305 and text 307 sub-streams. With regard tothe exemplary video sub-stream 303, feature extraction is performed ineach of the 2500 video frames which make up the video sub-stream 303, 40of which are shown for ease of explanation. Similarly, with regard tothe audio sub-stream 305, feature extraction is performed in each of the8000 audio frames 306 (FIG. 3 c) which make up the audio sub-stream 305,12 of which are shown for ease of explanation. In like manner, withregard to the text sub-stream 307, feature extraction is performed ineach of the 6500 text frames 308 (FIG. 3 d) which make up the textsub-stream 307, 5 of which are shown for ease of explanation.

Some examples of low, mid and high level features which may be extractedfrom the frames in each of the respective sub-streams (video, audio,text) are now described.

By way of non-limiting example only, the video-sub-stream may includethe following low 503, mid 505 and high 507 level visual mode features:

Low level visual mode features 503 may include, inter alia, motion value(global motion for the frame or video segment), the total number ofedges in a frame and dominant color.

Mid-level visual mode features are 703 derived from the extracted lowlevel visual mode features 503 and may include, inter alia, familyhistograms, camera motion, frame detail, face, presence of overlaidtext, and other object detectors.

High level visual mode features 903 are derived from the derivedmid-level visual mode features and may include, inter alia, variousvideo frame classifications such as an anchor frame, a reportage frame,an indoor frame, an outdoor frame, a natural frame, a graphics frame, alandscape frame and a cityscape frame.

By way of non-limiting example only, the audio sub-stream 305 mayinclude the following low 505, mid 705 and high 905 level audio modefeatures:

Low-level audio mode features 505 may include, for example, MFCC, LPC,average energy, bandwidth, pitch etc.

Mid-level audio features 705 are derived from the extracted low levelaudio mode features 505 and may include, for example, classification ofthe audio into speech, music, silence, noise, speech+speech,speech+noise, and speech+music.

High level audio mode features 905 are derived from the previouslyderived mid-level audio mode features 705 and may include, inter alia,crowd cheering, speaking, laughing, explosions, sirens and so on. Itcould also include speech to text transcript.

By way of non-limiting example only, the text sub-stream 307 may includethe following low 507, mid 707 and high 907 level text mode features:

Low-level text mode features 507 which may include, for example, thepresence of keywords, cues, names, places etc.

Mid-level text mode features 707 are derived from the low level audiomode features 507 and may include, for example, topics, categories,important nouns.

High level text mode feature 907 are derived from the derived mid-leveltext mode 707 features and may include, inter alia, question/answerpassages, an inference of who is speaking, i.e., news reporter v. anchorperson v. guest and so on.

FIG. 5 is a diagram further illustrating, by way of non-limiting exampleonly, the process of feature extraction comprising the extraction andderivation of features, in each of the three modalities, from therespective sub-streams 303, 305, 307. As shown, low level video features510 such as edge, shape, color 503 are extracted from the videosub-stream 303. One or more of the extracted low level video features503 may then be used to derive one or more mid-level features 703 suchas videotext, faces, family histograms 703. The mid-level features 703may then be used in turn to derive one or more high-level visualfeatures 903 such as anchor frame, reportage frame, indoor frame, etc.

With reference to the mid-level visual feature, ‘family histograms’shown as one element of 703, the derivation and use of this feature isof particular significance in that it is used to segment the videosub-stream 303 into ‘segments’ as will be described further below. Coloris a dominant feature in video and helps in segmenting video from aperceptual point of view. Additionally, the duration of a familyhistogram also maps directly to the computed ‘importance value’ of avideo segment, as will be described.

The process of deriving family histograms from the extracted low levelvisual features of the video sub-stream 303 involves an analysis of eachvideo frame of the video sub-stream 303. The analysis is performed toquantize the color information of each video frame into colorquantization bins. A simple 9-bin quantization color histogram wasexperimentally determined to be sufficient to identify the key elements.In a variation to this approach, a more complex 256-bin color histogrammay be used depending upon the application. The simple 9-binquantization color histogram approach assumes that there will only beslight differences in color variation from frame to frame for eachfamily segment contained within a news story segment. This is truebecause there is presumed to be substantial frame similarity from frameto frame for a key element. While appreciable color variations, willoccur from one frame to the next when a scene change occurs indicatingthe end of one family segment and the start of another. The colorhistogram approach detects the appreciable color variations (i.e., lowlevel feature) by a sharp contrast in color histogram values from oneframe to the next.

In order to find the degree of similarity between video frames,experiments were conducted with multiple histogram difference measures.In the family histogram computation act, for each video frame thehistogram is computed and then a search is made of the previouslycomputed family histograms to find the closest family histogram match.The comparison between the current histogram, H_(C), and the previousfamily histograms, H_(P), can be computed using one of the followingmethods for calculating the histogram difference D.

-   (1) Histogram difference using L1 distance measure is computed by    using the following formula:

$\begin{matrix}{D = {\sum\limits_{i = 1}^{N}{{{H_{C}(i)} - {H_{P}(i)}}}}} & {{EQ}.\mspace{14mu}\lbrack 1\rbrack}\end{matrix}$

Here, N is the total number of color bins used (9 in our case). Thevalues obtained using this formula range between 0 and twice the maximumnumber of pixels in respective images. Since we would like to obtainpercentage of similarity we normalize the value by dividing with thetotal number of pixels. The normalized values are between 0 and 1, wherevalues close to 0 mean that the images are similar, and those close to 1mean that the images are dissimilar.

-   (2) Histogram difference using L2 distance measure is computed by    using the following formula:

$\begin{matrix}{D = \sqrt{\sum\limits_{i = 2}^{N}\left( {{H_{C}(i)} - {H_{P}(i)}} \right)^{2}}} & {{Eq}.\mspace{14mu}\lbrack 2\rbrack}\end{matrix}$

Similarly to case (1) we normalize the values of D.

-   (3) Histogram intersection is computed using the following formula:

$\begin{matrix}{I = \frac{\sum\limits_{i = 1}^{N}{\min\left( {{H_{C}(i)},{H_{P}(i)}} \right)}}{\sum\limits_{i = 1}^{N}{H_{C}(i)}}} & {{Eq}.\mspace{14mu}\lbrack 3\rbrack}\end{matrix}$

The values obtained using this formula range between 0 and 1. The valuesclose to 0 mean that the images are dissimilar and values close to 1mean that the images are similar. In order to compare histograms withthe same interpretation of similarity we use D=1−I as a distancemeasure.

-   (4) The Chi-Square test for two image histograms is computed by    using the following formula:

$\begin{matrix}{I = \frac{\sum\limits_{i = 1}^{N}{\min\left( {{H_{C}(i)},{H_{P}(i)}} \right)}}{\sum\limits_{i = 1}^{N}{H_{C}(i)}}} & {{Eq}.\mspace{14mu}\lbrack 4\rbrack}\end{matrix}$

In this case, the values range between 0 and the number of color bins,N, so we normalize with N, i.e. D=χ²/N.

-   (5) Bin-wise histogram intersection is computed using the following    formula:

$\begin{matrix}{B = {\sum\limits_{i = 1}^{N}\frac{\min\left( {{H_{C}(i)},{H_{P}(i)}} \right)}{\max\left( {{H_{C}(i)},{H_{P}(i)}} \right)}}} & {{Eq}.\mspace{14mu}\lbrack 5\rbrack}\end{matrix}$

Similarly to histogram intersection, lower values mean dissimilarity andhigher values mean that images are similar. To be consistent with theprevious measures, the distance is computed using: D=1−B/N.

Color indexing methods that use histogram information are known in theart (see for example, the publication by M. Stricker and M. Orengo,entitled, “Similarity of color images”, In proc. Of IS&T/SPIE Conferenceon Storage and Retreival for Image and Video Database II, Vol. SPIE2420, 1995.

Act 220.b—Assigning Feature Importance Values

At act 220.b, those mid 710 and high 910 level features extracted at act220.a in each frame from each of the respective sub-streams 303, 305,307 are now assigned a corresponding feature importance value. Discreteand/or continuous feature analysis methods may be employed to assignsuch importance values. In the discrete case, the feature analysismethod outputs a discrete importance value indicating the presence orlack of presence of a feature (i.e., importance value=1 for present/0for feature not present) or (importance value=1 for desirable forinclusion in the multimedia summary 120, 0 for not desirable in summary120, and 0.5 if in between). As one example, because it is desirable tohave ‘faces’ in the multimedia summary 120, a feature importance valueof 1 may be assigned if one or two faces are present, a value of 0 maybe assigned if no faces are present and a value of 0.5 may be assignedin the case where more than two faces are present. Another discreteexample may be to assign a 0 for the presence of an anchor and a 1 forthe presence of a reportage passage. Another discrete example may be toassign 0 for a frame if it belongs to a family histogram whose durationis smaller than n% of the total duration of the news story and otherwiseassign a value of 1. Here n could be set to 10 etc.

With regard to the audio sub-stream 305, it may be desirable to havespeech in the multimedia summary 120, so an importance value could beset to 1 for the presence of speech, 0 for noise and silence, 0.5 for{music, speech+music, speech+speech, speech+noise}.

With regard to the text sub-stream 307, if there is a name or importantkeyword present, then the importance value may be set to 1 otherwise itis set to 0.

In a continuous case, in the case of a family histogram, the importancevalue could be set to the duration of the segment a frame belongs todivided by the total duration of the news story.

Alternatively, in the continuous case, the feature analysis methods mayemploy a probability distribution to assign importance values toextracted features. The probability distribution gives the probabilitythat the feature is present in the summary. The feature analysis methodsused with this approach may output a probability value which can rangefrom 0 to 1, indicating a degree of confidence regarding the presence ofa feature.

The probability distribution for deriving importance values in thecontinuous case can be derived from a normal Gaussian distribution.Alternatively, the importance values could be also be mapped as Poisson,Rayleigh, or Bernoulli distributions. Equation (2) illustrates, by wayof example, one way of computing the feature value for the frame as anormal Gaussian distribution.

$\begin{matrix}{{P\left( s \middle| \theta \right)} = {\sqrt{\frac{\theta_{2}}{2\pi}}{\mathbb{e}}^{{- {({1/2})}}{\theta_{2}{({x - \theta_{1}})}}}}} & {{Eq}.\mspace{14mu}(6)}\end{matrix}$Where: S is the probability the feature is in the summary

-   -   θ generally represents any of the features; and    -   θ1 is the average of the feature value; and    -   θ2 is the expected deviation.

As one example, if “faces” represents a mid level video feature to beconsidered, i.e., represented as θ in equation (6), then very small andvery large faces will rarely appear. Most often, whenever a “face”appears in the video stream, it is typically present at a height ofsubstantially 50% of the screen height. In this case θ₁ is equal to 0.5(the mean) and θ₂ is equal to 0.2, for example. It is noted that amaximum likelihood estimation approach can be used to determine theparameters θ₁ and θ₂.

It is noted that each of the features can potentially raise or lower theimportance value of a key element for potential selection in themultimedia summary 120.

220.c—Compute Importance Values Per Frame in Each Modality

At act 220.c, based on the feature importance values computed at act220.b, frame importance values are computed. To determine the frameimportance values, either a weighted sum approach or polling of theimportance values of the extracted features may be utilized, as will bedescribed.

Tables 1, 2 and 3 illustrate by way of non-limiting example only, thefeature importance values computed at act 220.b for each of theextracted features identified at act 220.a in each of the respectivemodalities (video, audio, text). The importance values are used tocompute the importance value per frame. The table column headingsrepresent previously extracted and derived low, mid and high levelfeatures such as edges, color, faces, silence, indoor frame and so on.

TABLE 1 Visual Feature probabilities Visual Visual Visual Visual featureI feature II feature III . . . feature N Frame 1 .8 .6 .9 .1 Frame 2 .5.3 .4 .4 Frame 3 .6 .5 .8 .9 . . . Frame A .2 .001 .4 .3

TABLE 2 Audio Feature probabilities Audio Audio Audio Audio feature Ifeature II feature III . . . feature M Time 1 .5 .6 .9 .1 Time 2 .15 .83.4 .4 Time 3 .6 .5 .8 .9 . . . Time B .2 .001 .4 .3

TABLE 3 Text Feature probabilities Text Text Text Text feature I featureII feature III . . . feature O Word 1 .5 .6 .9 .1 Word 2 .15 .83 .4 .4Word 3 .6 .5 .8 .9 . . . Word C .2 .001 .4 .3

The table values are combined in a manner to be described to provide ameasure of how much a frame is “worth”. A frames “worth” is a measure ofthe frame's significance for possible inclusion in the multimediasummary 120. A frame's “worth” may be computed in any number of waysincluding, deterministically, statistically and via conditionalprobabilities.

Deterministic Computation of a Frame's ‘Worth’

In one embodiment, a frame's ‘worth’ may be computed as a deterministiclinear function of low, mid, and high level video features, computed as:Key_Element_Importance=Σw_(i)f_(i)  Eq. (7)

Where: f_(i) is a value of a particular low, mid or high level featurein the feature vector; and

-   -   w_(i) is a weight for that feature.

The features f_(i) could be low level features such as motion value(global motion for the frame or video segment), total number of edges,dominant color, and mid level features such as family importance, cameramotion, frame detail, face size, overlaid text box size. High levelfeature can be a classification such as anchor/reportage, indoor/outdoorscenes, natural/graphics, and landscape/cityscape. The feature list isnot exhaustive and is only provided as exemplary of the types offeatures which may be included in the importance value computation.

It is noted that the weights, w_(i), associated with each feature can bedetermined a-priori by the summarization system 100 or alternativelydetermined in accordance with a user preference. For example, if a userwants to hear music in the multimedia summary 120, then a weight valuefor music can be set to 1. As another example, if the user prefers notto see any videotext in the summary, the absence of videotext in a frameis given importance of 1 and so on.

It is assumed that for each of the modalities, the feature importancevalues are combined in some manner to output a key element importancevalue per frame using either a single probabilistic or deterministicfunction which results in a list such as the non-limiting exemplary listshown in Table 4:

TABLE 4 Importance value (per frame) for different modalities VisualVisual Audio Audio Text Text Frame Importance/ Frame Importance/ FrameImportance/ label per frame label per frame label per frame Frame 1 .8Time 1 .6 Word 1 .1 Frame 2 .5 Time 2 .3 Word 2 .4 Frame 3 .6 Time 3 .5Word 3 .9 Frame 4 Time 4 Word 4 Frame N .2 Time M .001 Word P .3

In yet another embodiment, a frames ‘worth’ may be computed by findingthe conditional probability using a Bayesian Belief Network PatternClassification. Bayesian Belief Network Pattern Classification is knownin the art. See for example, Bayesian Belief Network PatternClassification (2nd Edition) by Richard O. Duda, Peter E. Hart, David G.Stork, the entire disclosure of which is incorporated by referenceherein in its entirety.

220.d—Segment Creation

At act 220.d, having compiled the frame importance values for each framein each modality at 220.c, the frame importance values are used tocombine or group the frames into segments for each modality.

Creating Visual Segments

To create visual segments from the respective video frames (i.e., Frame1, Frame 2, . . . , Frame N) which make up the video sub-stream 303,either a family histogram computation is performed or via shot changedetection. One way of combining frames into segments is by using shotchange detection. Shot change detection is well known and disclosed inU.S. Pat. No. 6,125,229, 26 Sep. 2000, also issued as EP 0 916 120 A2,19 May 1999 issued to Dimitrova, N; Mcgee, T; Elenbaas, J H, VisualIndexing System, the entire disclosure of which is incorporated hereinby reference. Another way of visual segments from the respective videoframes of the video sub-stream 303 is through the use of familyhistograms, as discussed above.

Creating Audio Segments

To create audio segments from the respective TIME frames (i.e., TIME 1,TIME 2, and so on) which make up the audio sub-stream 305, the segmentboundaries can be the boundaries of different classifications. That is,an audio classifier, classifies audio into speech (1), music(2),silence(3), noise(4), speech+speech(5), speech+noise(6), andspeech+music(7). FIG. 6 is a graph illustrating, by way of example, howthe time elements which comprise the audio sub-stream 305 of FIG. 3 maybe grouped to form segments. The graph plots audio classification v.time frames (time frame [x]). As shown, the initial frames (frames1-20,000) are mostly classified as music (2) frames. After which,successive frames are mostly classified as noise frames (4), followed byspeech and music frames (7).

The details of audio classification are further described in“Classification of general audio data for content-based retrieval”,Pattern Recognition Letters Vol. 22, number 5, pages 533-544 (2001),Dongge Li, Ishwar K. Sethi, Nevanka Dimitrova, incorporated by referenceherein in its entirety.

Creating Text Segments

To create text segments, the segment boundaries could be defined to besentence boundaries based on the punctuation provided in the closedcaption portion of the input video sequence 101, 102.

220.e—Segment Importance Value Determination

Segment importance value determination may be performed in one way byaveraging the frame importance values of the frames which comprise eachsegment to generate a single ranking or score. Another way of computinga segment importance value determination is to take the highest frameimportance value within the segment and assigning it to the wholesegment.

220.f—Segment Ranking

At act 220.e, a segment ranking (score) is computed for each segmentidentified at step 220.d in each of the respective modalities. Inaddition, the ranked segments are sorted in order of importance based onthe computed ranking or score.

Table 6 illustrates, by way of example, how the video segments (col. 1)and their associated segment importance values (col. 2) are ranked.Tables 7 and 8 show a similar construction for the audio and textmodalities, respectively.

TABLE 6 Visual Segment Importance Ranking Visual Importance SegmentValue Ranking Frames 1–6 .8 1 Frames 26–30 .6 2 Frames 7–25 .5 3 . . .Frame (N-23)-N .2 N

TABLE 7 Audio Segment Importance Ranking Audio Importance Segment ValueRanking Frames 30–45 .9 1 Frames 10–29 .8 2 Frames 100–145 .6 3 . . .Frame (N-10)-N .2 J

TABLE 8 Text Segment Importance Ranking Text Importance Segment ValueRanking Frames 5–65 .9 1 Frames 13–25 .7 2 Frames 26–29 .6 3 . . . Frame(N-100)-N .2 K

220.g—Key Element Identification.

At act 220.f, key elements are identified based on the segment rankingsof act 220.e.

FIGS. 7 a-c illustrate by way of example, several ways of identifyingkey elements. By way of example, FIGS. 7 a-c are graphs of (frameimportance value per) v. (segment) which could represent any of themodalities discussed above, i.e., Tables 6, 7 or 8.

FIG. 7 a is a graph illustrating a first method of identifying keyelements. Key elements are identified by selecting any segment whichappears above a pre-determined threshold.

FIG. 7 b is a graph illustrating a second method of identifying keyelements. Key elements are identified by selecting the local maxima,i.e., “A”, “B”, “C”, which appear above a pre-determined threshold, Th.

FIG. 7 c is a graph illustrating a third method of identifying keyelements. Key elements are identified by selecting the first N localmaxima without consideration for a threshold criteria.

It is noted that the process of identifying key elements described aboveand illustrated with reference to FIGS. 7 a-c, may be further modifiedin accordance with a user viewing profile. It is well known thatrecommendation systems generally operate by recommending items toparticular users based on information known about the users. Typicallysuch systems develop profiles of customers based on the customer'sprevious viewing or buying habits. In the present context, a user'sviewing profile can be created and preferably stored in the userpreference database 117 along with other user profile data discussedabove. The user's viewing profile may then be used to create a mappingfunction for mapping the previously described graph of (importancevalue) v. (segment), as illustrated in FIGS. 7 a-c, to a second functionwhich accounts for the client's viewing preferences. This process isoptional and may be implemented for any or all of the modalities.

Obviously, numerous modifications and variations of the presentinvention are possible in light of the above teachings. It is thereforeto be understood that within the scope of the appended claims, theinvention may be practiced otherwise than as specifically describedherein.

1. A method for producing a multimedia summary of at least onemultimedia stream the summary comprising key elements selected from saidat least one multimedia stream, the method comprising: a.) one ofreceiving and retrieving said at least one multimedia stream comprisingvideo, audio and text information; b.) dividing the at least onemultimedia stream into a video substream, an audio sub-stream and a textsub-stream; c.) identifying for potential inclusion in the summaryvideo, audio and text key elements from said video, audio and textsub-streams, respectively; d.) computing an importance value for theidentified video, audio and text key elements identified at said step(c); e.) first filtering the identified video, audio and text keyelements to exclude those key elements whose associated importance valueis less than a pre-defined video, audio and text importance threshold,respectively; and f.) second filtering the remaining key elements fromsaid step (e) in accordance with a user profile; g.) third filtering theremaining key elements from said step (f) in accordance with network anduser device constraints; and h.) outputting a multimedia summary whichcomprises key elements remaining from said step (g).
 2. The method ofclaim 1, wherein said at least one multimedia stream is one of an analogand digital multimedia stream.
 3. The method of claim 1, wherein thestep of dividing the at least one multimedia stream into a videosub-stream further comprises the step of identifying and grouping saidat least one multimedia stream into a plurality of news stories whereeach identified news story is comprised of an anchor portion and areportage portion.
 4. The method of claim 1, wherein the step ofdividing the at least one multimedia stream into an audio sub-streamfurther comprises dividing said at least one multimedia stream into aplurality of equal-sized frames of a fixed time duration.
 5. The methodof claim 1, wherein the step of dividing the at least one multimediastream into a text sub-stream further comprises dividing said at leastone multimedia stream into a plurality of frames wherein each frame ofsaid plurality of frames is defined on a word boundary.
 6. The method ofclaim 1, wherein the act of identifying video, audio and text keyelements from said video, audio and text sub-streams further comprisethe acts of: 1.) identifying low, mid and high level features from theplurality of frames which comprise said video, audio and text,substreams; 2.) determining an importance value to each of saidextracted low, mid and high level features from said identifying act;3.) computing a frame importance value for each of said plurality offrames which comprise said video, audio and text sub-streams as afunction of the importance values of the feature importance valuesdetermined at said determining act; 4.) combining the frames intosegments in each of said video, audio and text sub-streams; 5.)computing an importance value per segment for each segment from saidcombining act; 6.) ranking the segments based on said computedimportance value at said computing step; and 7.) identifying keyelements based on said ranked segments.
 7. The method of claim 6,wherein said act (3). of computing a frame importance value for each ofsaid extracted low, mid and high level features further comprisescomputing said importance value by one of deterministic, statistical andconditional probability means.
 8. The method of claim 7, wherein saidconditional probability means comprises computing said frame importancevalue as one of a Gaussian, Poisson, Rayleigh and Bernoullidistribution.
 9. The method of claim 8, further comprising computing theGaussian distribution according to the equation:${P\left( s \middle| \theta \right)} = {\sqrt{\frac{\theta_{2}}{2\pi}}{\mathbb{e}}^{{- {({1/2})}}{\theta_{2}{({x - \theta_{1}})}}}}$where: θ is any of the features; θ₁ is the average of the feature value;and θ₂ is the expected deviation.
 10. The method of claim 7, whereinsaid deterministic means comprises computing said frame importance valueaccording to the equation:Σw_(i)f_(i) where: f₁ represent low, mid-level and high-level features;and w₁ represent weighting factors for weighting said features.
 11. Themethod of claim 6, wherein said step (4) of combining the frames intovideo segments further comprises combining said frames by one of familyhistogram computation means and shot change detection means.
 12. Themethod of claim 6, wherein said step (4) of combining the frames intoaudio segments further comprises the steps of: categorizing each framefrom said audio sub-stream as one of a speech frame, a music frame, asilence frame, a noise frame, a speech + speech frame, a speech + noiseframe and a speech + music frame; and grouping consecutive frames havingthe same categorization.
 13. The method of claim 6, wherein act step (4)of combining the frames into text segments further comprises combiningsaid frames based on punctuation included in said text sub-stream. 14.The method of claim 6, wherein said step (5) of computing an importancevalue per segment further comprises averaging the frame importancevalues for those frames which comprise said segment.
 15. The method ofclaim 6, wherein said step (5) of computing an importance value persegment further comprises using the highest frame importance value insaid segment.
 16. The method of claim 6, wherein said step (7) ofidentifying key elements based on said rankings further comprisesidentifying key elements whose segment ranking exceeds a predeterminedsegment ranking threshold.
 17. The method of claim 6, wherein said step(7) of identifying key elements based on said rankings further comprisesidentifying key elements whose segment ranking both exceeds apredetermined segment ranking threshold and constitute a local maxima.18. The method of claim 6, wherein said step (7) of identifying keyelements based on said rankings further comprises identifying keyelements whose segment ranking constitutes a local maxima.
 19. A systemfor producing a multimedia summary of at least one multimedia stream,the summary comprising key elements selected from said at last onemultimedia stream, the system comprising: a modality recognition anddivision (MRAD) module comprising a story segment identifier (SSI)module, an audio identifier (AI) module and a text identifier (TI)module, the MRAD module communicatively coupled to a first externalsource for receiving said at least one multimedia stream, the MRADmodule communicatively coupled to a second external source for receivingsaid at least one multimedia stream, the MRAD module dividing said atleast one multimedia stream into a video, an audio and a text sub-streamand outputting said video, audio and text sub-streams to a key elementidentifier (KEI) module, the KEI module comprising a feature extraction(FE) module and an importance value (IV) module for identifying keyelements from within said video, audio and text sub-streams andassigning importance values thereto, the KEI module communicativelycoupled to a key element filter (KEF) for receiving the identified keyelements and filtering said key elements that exceed a pre-determinedthreshold criteria, the KEF module communicatively coupled to a userprofile filter (UPF) for receiving filtered key elements and furtherfiltering said filtered key elements in accordance with a user profile,the UPF module communicatively coupled to a network and deviceconstraint (NADC) module, said NADC module receiving said furtherfiltered key elements and further filtering said further filtered keyelements in accordance with network and/or user device constraints, theNADC module outputting a multimedia summary of said at least onemultimedia stream which comprises key elements remaining after saidfurther filtering in the NADC module.
 20. The system of claim 19,further comprising a user preference database communicatively coupled tosaid UPF module for storing user profiles.
 21. The system of claim 19,wherein the first external source is a broadcast channel selector. 22.The system of claim 19, wherein the first external source is a videostreaming source.
 23. The system of claim 19, wherein said at least onemultimedia stream is one of an analog and digital multimedia stream. 24.The system of claim 19, wherein the NADC module is communicativelyconnected to an external network coupled to a user device.
 25. Thesystem of claim 19, wherein the network is the Internet.
 26. Anon-transitory computer readable medium having computer readable codemeans embodied thereon, said computer readable program code meanscomprising: an act of one of receiving and retrieving said at least onemultimedia stream comprising video, audio and text information; an actof dividing said at least one multimedia stream into a video sub-stream,an audio sub-stream and a text sub-stream an act of identifying forpotential inclusion in the summary video, audio and text key elementsfrom said video, audio and text sub-streams, respectively; an act ofcomputing an importance value for the identified video, audio and textkey elements identified at said identification act; an act of firstfiltering the identified video, audio and text key elements to excludethose key elements whose associated importance value is less than apre-defined video, audio and text importance threshold, respectively;and an act of second filtering the remaining key elements from saidfirst filtering act in accordance with a user profile; an act of thirdfiltering the remaining key elements from said second filtering act inaccordance with network and user device constraints; and an act ofoutputting a multimedia summary which comprises key elements remainingfrom said third filtering act.
 27. A non-transitory computer readablemedium of claim 26 further wherein the act of identifying video, audioand text key elements from said video, audio and text sub streams,respectively, further comprises: an act of identifying low, mid and highlevel features from the plurality of frames which comprise said video,audio and text sub-streams; an act of determining an importance value toeach of said extracted low, mid and high level features from saididentifying act; an act of computing a frame importance value for eachof said plurality of frames which comprise said video , audio and textsub-streams as a function of the importance values of the featureimportance values determined at said determining step; an act ofcombining the frames into segments in each of said video, audio and textsub-streams; an act of computing an importance value per segment foreach segment from said combining act; an act of ranking the segmentsbased on said computed importance value at said computing act; and anact of identifying key elements based on said ranked segments.